Data Visualization Project 2: Analysis of the World Happiness dataset

Author

Caroline Graebel, Nina Immenroth, Bogdan Kostić, Naim Zahari

1 Top Level analysis of the data

To explore interesting structures in the data we will use basic variable plotting around the main variable “happiness_score” to get a first feeling for the data. Afterwards we will look into how a RandomForest Classifier model would rate the importance of variables when trying to predict the happiness score. Lastly, a Kmeans model is used to look into a possible clustering of the data and how the resulting clusters can be interpreted.

1.1 Introducing the variables

INSERT VARIABLE MEANINGS HERE

1.2 Looking at distribution of the happiness score

Figure 1: Distribution graph of the happiness score.

When looking at our variable of interest, it’s clearly visible that the distribution is quite close to a normal distribution.

1.3 Looking into relationship of the happiness score to predictors

1.3.1 Region

Figure 2: Boxplot Happiness Score based on different regions.

When looking at how the happiness score is connected to the regions of the countries we have data for, we can see that they differ a lot in their spread. Western Europe and North America and ANZ have the comparatively highest median happiness scores. In contrast, for Sub-Saharan Africa and South Asia the median happiness is the lowest. The regions also differ strongly in the variance. The interquartile range (the range of variables within the first and third quarter percentiles) is larger in the Middle East and North Africa while for Africa and North America the boxes are very small.
Regions is understood as an aggregation for the variable country, as there are a big number of countries covered that it is not possible to visualize it. We are looking into countries that show interesting patterns later in the report.

(a) Happiness Score over GDP per Capita
(b) Happiness Score over Social Support
(c) Happiness Score over Healthy Life Expectancy
(d) Happiness Score over Freedom to make Life Choices
(e) Happiness Score over Generosity
(f) Happiness Score over Perceptions of Corruption
Figure 3: Scatterplots for relationship of happiness score and different predictor variables.
Figure 4: Correlation matrix plot.

1.3.2 GDP per Capita

There seems to be a positive linear relationship between GDP per capita and the happiness score, as with higher GDP the happiness score rises. The relationship is stronger than it seems from the plot, as the scale ratio of x- and y-axis are so different. Calculating the correlation helps to provide further proof for the linear relationship between the variables.

Correlation of the happiness score and GDP per capita: 0.7238105

The correlation of these variables, which can be seen in figure Figure 4, is very high with a value of 0.72.

1.3.3 Social Support

Similar to GDP per Capita, there is a positive linear relationship between both variables.

Correlation of the happiness score and social support: 0.6481553

The correlation between social support and happiness score isn’t as strong as for GDP per capita but still very strong.

1.3.4 Healthy life expectancy

Here, we again have a positive linear relationship between the two variables. Interestingly, for healthy life expectancy some values are missing.

Correlation of the happiness score and healthy life expectancy: 0.6823998

There a strong positive correlation between healthy life expectancy and the happiness score.

1.3.5 Freedom to make life choices

In this case, there also is a positive linear relationship between freedom to make life choices and happiness score, even though it seems a bit weaker than with the variables discussed so far.

Correlation of the happiness score and freedom to make life choices: 0.5694581

In general, there is a strong positive correlation between the variables but it is the weakest so far.

1.3.6 Generosity

There is no connection between the variables visible here.

1.3.7 Perceptions of corruption

For perceptions of corruption, there seems to be a slight upward trend in happiness score for higher perceptions of corruption but it’s only a fraction of data points that score high on perceptions of corruption. Also there is missing data for this variable.

Correlation of the happiness score and Perceptions of corruption: 0.4150709

Calculating the correlation shows that there is again a positive correlation between the two variables, even if it appears weak compared to other variables that have been looked at.

1.3.8 Year

Figure 5: Boxplot Happiness Score over Years.

For medians, there is a slight upwards trend over the years, which 2020 to 2022 having little variance compared to the other years. There are also negative outliers for 2021 to 2023.

2 Using K-means Clustering to find interesting patterns

It is the goal to use K-means clustering to provide a pattern that is interpretable and give further context to the relationship between the happiness score and other variables. The procedure will first be introduced, then the data is scaled for an optimal performance and the best amount of clusters will be chosen by using the elbow-criteria to sensibly minimize total within sum-of-squares Explain WSS. At the end, a final model is trained and the result is plotted.
It is important to mention that K-Means can only be used for continuous variables, so country, region, and year are not considered here.

2.1 Introduction K-Means

K-Means Clustering is an unsupervised machine learning method that can help classify data that has no label that a model could be trained on. This is made possible by a distance-based approach that fits the clusters so that the distances between the data points within a cluster are minimal. The distance in this case is a measure of similarity between data points, so in other words we want to fit clusters so that the points contained in one cluster are as similar as possible.

2.2 Trying PCA to improve K-Means performance

A good measure for improving the performance of K-Means is doing PCA on the data as it can aggregate the information. However, PCA is only useful if you can cover around 80% of the data with only few principle components.

Figure 6: Cumulative Propability of the Variance explained by each Principle Component.

As can be seen, it would be necessary to use four to five of seven variables to cover a sufficient amount of variance. Since this isn’t too helpful and each principle component contains a similar amount of variance, there will be no PCA used before doing K-Means.

2.3 Finding a good amount of clusters

In general, the higher the number of clusters, the more similar points will be within one cluster. However, the model also gets harder to interpret and messy. So using the elbow-criteria, the hyperparameter k that equals the number of clusters will be sensibly minimized so that we get an interpretable result.
We test the total within sum of squares for two to ten clusters.

Figure 7: Within sum of squares over different amounts of clusters.

Using the elbow criterium, it can be seen that after k = 2, the decrease of within sum of squares isn’t as strong anymore. Therefore, the final K-Means model is trained with two clusters.

2.4 Resulting plots

Figure 8: Matrix of scatterplots coloured by cluster.

When looking at the scatterplots, the first thing to be noted is that for happiness score, there is the cleanest split between the clusters which basically splits the plot into a happiness score that is bigger than 0 and vice versa. In other words, the clusters are strongly informed by whether the happiness score is in the upper 50% quantile or below.

Median of the happiness score: -0.0004473365

For further context, the median value of the happiness score is almost perfectly zero. From the initial plots it has been shown that the happiness score correlates with all variables plotted here except for generosity. From the matrix of scatterplots we can gather that all plots within the matrix show the same pattern of the red cluster being in the lower left and the black cluster in the upper.Last sentence?